IS643 Project - Machine Learning - James Quacinella

Intro - Why Use ML?

For this part of the project, we investigate the use of some machine learning algorithms to detect and predict patterns in time series data. If we can predict the direction of the value of some asset at a future date (an exact prediction would be highly difficult with continuous data), we can make a bet today that the asset will go in that direction. Hopefully, at that future date, the asset has moved in the predicted direction, which allows us to capture profit. We are trying to find patterns in the history of the asset to help inform us about future behaviour.

General Outline of Algo

For this project, we will attempt to creat a Random Forest Classifier, trained on historical data of some asset price. This classifier will take in x previous days of data, and will output a prediction that will state that the asset priec will go higher, lower, or unknown. How do we train this classifier?

Since this is a supervised machine learning algorithm, we need to give the classifier a set of data, consisting of an input feature vector, and the output, a known correct classification. How do we construct these samples? One would think that we should just try to train the classifier on the previous asset prices, but this would not be effective. We want to generalize from the data, and using exact asset prices will only detect patterns in those exact prices. We want to capture the shape of the asset price and to do so, we can use percent change.

So to construct a training example, we can look a day in the past, grab the asset price for the previous x number of days and from there, construct a vector, of length x-1, of forward percent changes, which is defined as x[i+1] - x[i] / [x[i]. With this as input, we can then look y days into the future and see if the asset has gained or lossed. To help generalize the data, we should make sure the asset price has changed enough for us to warrant a categorization, so we can check that the asset price moved further than some percantage.

With the classifier trained, we can now use it to predict what might happen in the future as we trade. We can look at x days in the past, construct a percent change vector and ask the classifier for a predicition. Based on this prediction, we can either short or long the asset. After that, we need to check y days later and close out the position. If the prediction is right, we make some profit.

Parameters

The start of the algorithm initializes some parameters we can tweak:


In [ ]:
context.params = dict((stock, {"years": 5, 
                                "historicalDays": 30, 
                                "predictionDays": 5,
                                "percentChange": .02,
                                "orderSize": 2000}) for stock in context.stocks)

These are default values for all stocks in the portfolio, but in the lines after it, I assign custom values per SID. This, like in Part 1, allows me to tweak the training and classification on a per-asset basis. Much of this project was spent on tweaking these parameters to get good results.

The percentChange parameter has a big effect on the classifier and how many times it comes up with an actual prediction (versus outputting 0 for 'unknown'). If you set this too high, the training data will be essentially all zeros, and will not predict too much. Set it too low, can you may be creating training examples that are spurious or capture random noise. For most assets, 2% change seemed like a good value to go with.

Years, which controls how much historical data we look at in training, also highly effects the results. Originally set to 5 years, results improved fpr SPY after changing to it only 1 year. This might mean that we should generally favor the recent past over the distant past.

I considered playing with some of the parameters to the RFC itself, like n_estimators, criterion, max_depth etc, but I did not find any significant results doing so.

Quantopian Algo and Results

Location: https://www.quantopian.com/algorithms/555959481062a3bb90000185/555d3a0e27ca8b1057b9a3c1#algorithm

Many times the classifier does not give a prediction one way or the other, depending on the parameter values. This is good, as ew do not want to trade too often, espeically if the classifier is not confident in its output. One thing to investigate in the future would be classifier models that return not only an answer but a confidence score and only trade when that score is above some threshold.

Interestingly enough, when debugging my code for a single asset, it seemed that sometimes algo is better by flipping signs of the classifier. That means sometimes a model is so bad, its better to do the exact opposite of what it predicts. However, due to the randomness of the algorithm, this was not consistent.

One thing to worry about is drift. Training a model on historical data ia great, but we need to keep this model up to date as time goes one. To see if drift was a problem, I retrain the RFC model every year. It is tough to see if doing so helped across all parameters and assets, but I feel like its a good idea generally speaking, so I kept it. One idea for the future would be to somehow weigh the earlier data less so than the later data when training the classifier.

It would be nice to be able to save a model that was seen as good and reload it via pickle. Right now, with the RFC being random, will give different results everytime when run. The code is at a point where the results are normally good (since I trade a portfolio where a few of them do well usually), but when working with a single asset, multiple runs can show a wide variance in ultimate returns.


In [9]:
from IPython.display import Image
Image(filename='part2/Part2-BestResults.png', height="90%", width="90%")


Out[9]:

Other Future Ideas

  • Do more research on how to wweak params sent to RFC, like number of trees etc
  • Can we combine training samples from mroe than one asset if they share similar movements?
  • Would stop-loss ordering help minimize the downside risk?
  • Would have used PyBrain to do the above but with a neural net classifier